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Abstract Protein structural alignment is an important problem in compu- 
tational biology. In this paper, we present first successes on provably optimal 
pairwise alignment of protein inter-residue distance matrices, using the pop- 
ular DALI scoring function. We introduce the structural alignment problem 
formally, which enables us to express a variety of scoring functions used in 
previous work as special cases in a unified framework. Further, we propose the 
first mathematical model for computing optimal structural alignments based 
on dense inter-residue distance matrices. We therefore reformulate the prob- 
lem as a special graph problem and give a tight integer linear programming 
model. We then present algorithm engineering techniques to handle the huge 
integer linear programs of real-life distance matrix alignment problems. Ap- 
plying these techniques, we can compute provably optimal dali alignments 
for the very first time. 

Keywords Protein structure distance matrix alignment • Algorithm 
engineering • Integer linear programming • Branch-and-cut • Preprocessing • 
DALI 



1 Introduction 

Proteins are chains of amino acid residues that fold into complex three-dimen- 
sional (3D) structures. These structures largely determine the biological func- 
tion of a protein. Structural alignment is the problem of detecting structural 
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similarities between two proteins — similarities that indicate, for example, a 
similar biological function or a common evolutionary origin. Due to numerous 
applications in biology and medicine, it is an extremely important problem in 
computational biology. Unlike the related problem of sequence alignment, for 
which polynomial algorithms exist, structural alignment is computationally 
difficult. Current structural alignment algorithms fall in one of two categories: 
(i) they aim at achieving a low RMSD (root mean square deviation) of the 
superpositioned substructures for, ideally, a large number of aligned residues, 
or, (ii) they are based on aligning the inter-residue distances of the two pro- 
teins. Given the distances between all residues of a protein, its 3D structure 
can, except for chirality, be precisely reconstructed [3]. Inter-residue distance 
matrices are thus well-suited and widely used representations to align protein 
structures, and used in numerous programs. 



From an algorithmic perspective, structural alignment algorithms can be 
divided into heuristics [71PT1ITB] or exact methods pi lSlfTTll^l^ . While the 
first are typically fast, they do not provide a theoretical quality guarantee of 
their results, let alone for optimality. Exact methods compute provably opti- 
mal solutions, but may suffer from performance problems. Among the exact 
approaches, methods that maximize the number of short aligned distances, 
or the contact map overlap (CMO), have been shown to be the fastest [H[7]. 
However, the weak point of CMO approaches is that the scoring function is 
too basic [23]. The other extreme is the sophisticated dali (Distance-matrix 
ALignment) scoring function Targeting higher precision, it takes into con- 
sideration all inter-residue distances and assigns individual scores. Yet, because 
of the computational complexity of structural alignment and the huge size of 
the problem instances, it is extremely difficult to design an exact algorithm 
that uses this or similarly complex scoring functions. 



In this paper, we present first successes on computing provably optimal 
alignments for complete inter-residue distance matrices using the dali scoring 
function. We decide to focus on dali, because it is a widely used structural 
alignment program with a public web service and the accuracy of its alignments 
is higher or among those of the top methods in various benchmarks P^I19) . 
We introduce the structural alignment problem formally in Section [21 enabling 
us to express a variety of scoring functions used in previous work as special 
cases in a unified framework. In Section [3| we propose the first mathematical 
model for computing optimal distance matrix alignments. We therefore refor- 
mulate the problem as a special graph problem and give a tight integer linear 
programming (ILP) model. We present algorithm engineering techniques to 
solve this model to optimality for practical instance sizes (Section [Jj). Finally, 
in Section [21 we discuss our computational results that include first provably 
optimal DALI alignments for non-trivial structural similarities. 
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2 Problem Formulation 

We now formally define the problem of aligning two protein structure distance 
matrices. Our general definition will allow us to express many related problems 
that have been addressed in previous work as special cases. 

A distance matrix I? of a protein of length n is a symmetric n x n matrix. 
Rows and columns correspond to individual residues of the protein, in the order 
they appear in the protein sequence. The entries Dij, 1 < i < n, 1 < j < n, 
denote the distances between residues i and j. Many variants are possible, e.g., 
distances between the Ca or Cjs atoms, or the minimum distance between any 
atoms of the residues. 

An alignment of two distance matrices A and B of two proteins of length 
UA and ns is a matching of a subset of residues from protein A with a subset 
of residues from protein B that respects the sequential order of residues in 
their respective chain. More precisely, it is a pair of sequences of indices (/, J) 
with / = (^1,^2, . . . ,«„^) and J = {.ji, 32, ■ ■ ■ , Jul) satisfying |/| = |J| = ul, 
1 < zi < i2 < . . . < iriL ^ '^A, and 1 < ji < ^2 < ■ • • < jm, < nB, with the 
interpretation that residue ip is aligned to residue jp for p = 1 , . . . , . Such 
an alignment (/, J) induces an alignment of pairs of inter-residue distances 
and we define its score as 

SiI,J) :-^^s(A.„...,i3j„,jJ + ^gap(/,J), (1) 

O— 1 p—1 

where s : the scoring term of aligning individual inter- 

residue distances from protein A with distances from protein B and S'gap(/, J) 
is a function penalizing the gaps of the alignment (/, J). We denote by A{A, B) 
the set of all possible alignments of two distance matrices A and B. 

The problem consists now in finding the best alignment between the two 
matrices with respect to S. It is JVP-hard, following from [15,, since variable 
length gaps are admitted into the alignment and interactions between amino 
acid residues from the sequence are admitted into the scoring function. 

Problem 1 (Distance matrix alignment) Given two distance matrices A 
and B, find an alignment (/, J)* of A and B with 

(/, J)* = argmax S{I,J) . 

{LJ)eA(A,B) 

A large number of algorithms exists that address the distance matrix align- 
ment problem using a variety of scoring functions, all of which can be expressed 
in our framework. In the following we present some of the most popular scoring 
schemes. 

Contact map overlap (CMO). Based on a distance threshold r, two residues 
are defined to be in contact or not. Many algorithms have been presented 
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151125) that aim at finding an alignment with the maximum number of shared 
contacts. This corresponds to S'gap(/, J) = and 



s{Aij,Bkj) = 



1 Ai,j ^ T and Bk^i < t 
otherwise, 



for I < i < j < riA and I < k < I < ub and some r G R-°. 



DAST scoring. Motivated by the fact that maximizing shared contacts of- 
ten generates ahgnments with large RMSD, the DAST algorithm |T7] focuses 
on local alignments with low RMSD. This corresponds to aligning cliques 
of distances with low mutual distance differences and can be expressed as 
5'gap(/, J) = and 



1 \A,^j-BkA<T 
— oo otherwise, 



for \ < i < j < riA and \ < k < I < ub and some r G 



DALI scoring. DALI is a popular heuristic for protein structural align- 
ment. Expressed in our model ^ its scoring function [51[TT| corresponds to 
^gap(/, J) = and 



r(0.2- J^^:i^)e-(^(-4.,,+B.,,)/20) , ^ ^ and fc ^ Z 
0.2 otherwise. 



In our experimental results we will focus on the DALI scoring scheme. 



MATRAS scoring. Analogous to sequence substitution matrices, the structural 
alignment algorithm matras [13], uses log-odds value matrices M as struc- 
tural scores. A value My^^^ y4^\^ indicates the log-likelihood that distance di is 
aligned to distance d2. A positive log-likelihood means that the correspond- 
ing distances are aligned more likely than expected by chance. In our scoring 
model, MATRAS scoring can be realized by using function Sg^pil, J) for afhne 
gap costs and s{Aij,Bk.j) = MyA.ji.iBk.il- 



Protein threading problem (FTP). It is interesting to note that local PTP |B] 
can also be modeled in this framework with similar log-odds-based scoring 
functions JB: as matras. 
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SSAP scoring. SSAP [22j is an early approach to structural alignment and uses 
a structural scoring function that can be expressed as 



[ — (200 niin{n^, ns}) 2 otherwise, 

where and V^f; are vectors between residue i and j in protein A and k 
and I in protein B, resp. Using the difference of vectors for scoring, SSAP can 
account for the directionality of the compared inter-residue distances. Function 
5'gap(/, J) assigns a penalty of —5 for each gap, independent of the gap's length. 
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3 Mathematical Model 

We now present a mathematical model for structural alignment of protein 
distance matrices. To our knowledge, it is the first model for optimal protein 
structural alignment in a general setting as defined above. First, we reformulate 
the alignment problem in a graph-theoretical framework: An alignment graph 
G = (y, E) of two proteins of size ua and is a x ns grid graph in which 
each node i.k represents a possible alignment of a residue i in protein A to 
a residue k in protein B, see also Figure [TJ Each row of G, from bottom to 
top, corresponds to a residue in protein A, and each column, from left to right, 
corresponds to a residue in protein B. Edges in G represent possible alignment 
of pairs of distances. An edge {i.k,j.l) denotes the possible alignment of the 
distance between residues i and j in protein A with the distance between k 
and I in protein B. Alignment graphs for structural alignment have first been 
used for CMO [2], here we adapt them to a more general setting. 



i. j i. j 




I k,l I k,l 

1 2 3 4 1 2 3 4 

Fig. 1 Alignment graph for nodes that remain after preprocessing. Left: the set of nodes 
{1.1, 2.2, 3.4} forms a strictly increasing path that induces a structural alignment with score 
a + fi + f plus the score of the nodes. Here, a = 2s(Ai_2, Si, 2), /3 = 2s(A2,3, 82,4), 7 = 
2s(Ai,3, i3l_4), and the node score , Bn) -|- s(yl2,2 , ^2,2) -I- ^(^3,3, ^4,4). Right: Nodes 

on the decreasing path {4.1, 4.2, 3.2, 2.2, 2.3, 1.3, 1.4} mutually contradict. 
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We say a node j.l is strictly larger than i.k if and only ii j > i and I > k and 
strictly smaller than i.k ii and only if j < i and I < k. Edges are directed: any 
edge in the alignment graph has the property that its tail is strictly smaller 
than its head. It is thus oriented from south-west to north-east (cf. Figure [1]). 
Because of the partial ordering between nodes and because there exist only 
edges between ordered nodes, the alignment graph is a directed acyclic graph. 
An alignment is a subset {zi.fci, 12.^2, . . .} of nodes that can be ordered such 
that each node is strictly larger than the previous one, i.e., ii < ^2 < ■ • • and 
ki < k2 < . . .] we then say the nodes lie on a strictly increasing path. There is 
a one-to-one correspondence between alignments and strictly increasing paths. 
Every structural alignment can be represented by nodes on a strictly increasing 
path together with all induced edges {i.e. the associated augmented path). Its 
score is the sum of edge and node weights (cf. Figure [TJ left). Determining the 
structural alignment of maximum score is equivalent to the maximum weight 
augmented path problem in the corresponding alignment graph. 

Nodes in the alignment graph contradict if they are not on a strictly 
increasing path. A decreasing path contains a set of mutually contradicting 
nodes. That is, it is a set {ii./ci, 12.^:2, .. .} of nodes with ii > ^2 > ■ • • and 
ki < k2 < . . . (cf. Figure HJ right). We denote the set of all decreasing paths 
by C. Strictly smaller and strictly larger nodes exist only in strictly increasing 
paths and not in decreasing paths. 

We introduce some further notation describing the neighborhoods of a node 
i.k. By V~{i.k) we denote the set of nodes that are strictly smaller than i.k 
(left neighborhood), and by V~^{i.k) the set of nodes strictly larger than i.k 
(right neighborhood). The set contains all decreasing paths in V~{i.k) 
and the set C^^. all decreasing paths in V~^{i.k). 

Our integer linear programming formulation uses two types of variables. Bi- 
nary variables Xik represent nodes in the alignment graph and indicate whether 
residues i and k are aligned. Variables yikji denote whether the alignment 
graph edge {i.k, j.l) is present in the solution, i.e. whether distance Aij is 
aligned with distance Bk.i. 

riA — l riA riB—l ns tla riB 

max X! X! X! X! '^^i^i,jBk,i)yikji +'^'^s{Ai,iBk,k)xik (2) 

i=l j=i+l k=l l=k+l 1=1 fe=l 

s.t. Xik > ^ '^i^fe'* [i'*^^ - e [l,"i3 - 1] (3) 
j.iec 

Xik > yjiii' ^ '^i^fe' * ^ i^A],ke [2, ub] (4) 
j.iec 

a^ifc < 1 +^(2/ifej7 - a;^/) VC G C+j., i G [1, - 1], A; G [1, ns - 1] (5) 

s(A,j,Bkj)<0 

Y ^^k<l yCeC (6) 
i.kec 

X binary, y > (7) 
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Note that the objective function ^ of this ILP only models the structural 
part of dH), and not S'gap(/, J). It is possible to integrate also linear or affine 
gap costs. For the sake of simplicity, we do not describe this here but refer 
to [T] where this is done for multiple sequence alignment. 

Constraints ([B]) guarantee that the x variables form a proper alignment, 
i.e., a strictly increasing path in G. Constraints ^ and (U) link x and y vari- 
ables. They prevent activating edges for which source or target node are not 
activated as well. Similarly, inequalities (O force the activation of edges whose 
endpoints are activated. This is a novel class of constraints and necessary be- 
cause edges (i.fc, j.l) with s{Aij, Bk,i) < would otherwise never be part of an 
optimal solution. Note that all inequality classes have exponential size. Every 
feasible solution of model (H])-® is a structural alignment and constitutes in 
the alignment graph a strictly increasing path with its induced edges. 

4 Algorithm Engineering 

In this section, we describe several algorithm engineering techniques to solve 
the integer linear program from the previous section to optimality for practi- 
cally relevant instance sizes. We tried out several approaches, including solv- 
ing a less tight but polynomially-sized model, developing a branch-and-cut 
approach for the full model, combinations of the two models, several prepro- 
cessing techniques, and a divide-and-conquer approach. Due to space limita- 
tions, we only describe the most successful approach here, which is a variable 
elimination preprocessing step followed by a branch-and-cut approach for the 
full model. 



4.1 Preprocessing Using Variable Elimination 

We reduce the number of alignment graph nodes by over-estimating the score 
of the best structural alignment including the respective node. If this overesti- 
mated score is less than the score of a known feasible solution, the correspond- 
ing node and its adjacent edges are fathomed, i.e., deleted from the alignment 
graph. The efficiency of this process depends on the quality of the used lower 
bound. We are currently using the heuristic solution provided by DALI [llj . 

In order to compute overestimations we use double dynamic programming, 
similar as in [5]. In the first, local, dynamic program, we compute a profit pi,k 
for any node i.k G V, which corresponds to the largest value that this node 
can add to the global score. To compute pi,k we focus only on incoming and 
outgoing edges from i.k, and consider only those with score s{Aij, Bk^i) > 0. 
These scores are then assigned to the corresponding nodes j.l G V~{i.k) and 
V+(i.fc), respectively. The weight of the heaviest strictly increasing path 
in V^{i.k) (resp. p~f, in V~{i.k)) can be computed in time proportional to the 
size of the rectangles V~^{i.k) (resp. V~{i.k)) using dynamic programming. 
Finally, we set pi,k = p^^ + Ptk- 
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i a 1 a 

k k n k p k n p 

Fig. 2 Different computations for profit pi of node i.k. In black the feasible nodes. Left: 
Coarse profit computation. In black V~{i.k) and V^{i.k). Second from left: Fine profit 
computation for filtering node m.n when i.k < m.n. stays unchanged, but p'f^. can 
be recomputed by considering only nodes in V+(i.fc) H (V~(m.n) (J V"'"(m.n)) . Second from 
right: The same fine profit computation for filtering o.p. For filtering edge (m.n, o.p), we 
can take the minimum of the two cases. Right: Profit computation for i.k for fine filtering 
of edge {m.n, o.p). 

The second level, global, dynamic program consists in overestimating any 
solution containing a given node m.n, i.e., in which Xm.n = 1- For this purpose 
we associate profits p^.fc to all nodes, and using them as weights we compute the 
heaviest strictly increasing path in V~ (m.n) (respectively in {m.n)). Adding 
the weights of these two paths, as well as Pm.m we obtain an overestimated 
score for a node m.n. 

The efficiency of this variable elimination depends on the preciseness of the 
profit computation. The coarsest but also fastest method to potentially elimi- 
nate a node m.n is to use the previously computed profits pi,k. The advantage is 
that the profits can be computed once for all nodes in the beginning. However, 
better upper bounds can be obtained by using profits p^fc" ^^^^ computed 
in the same way as pi,k but considering only nodes in V~ (m.n) IJ (m.n). 
This fine filtering gives tighter overestimations, but is more time-consuming. 

Besides eliminating nodes, we can also eliminate edges. We implemented 
two ways of accomplishing this. One way, the so called coarse filtering of edge 
(m.n, o.p), uses the same profits as for coarse node filtering, cf. Figure [2] left. 
We can do better by re-using the profits p™^" and p° f computed during the 
fine filtering for nodes (m.n) and (o.p). Instead of Pi.k, we use in this case 
updated profits pi,k defined as pi,k = min(p™^;",p°;^). If p™^" and p°]^ are 
stored during the fine node filtering, such a finer edge filtering is then not 
more expensive than coarse filtering. 



4.2 Branch-and-Cut 

We use a cutting plane algorithm [21] within a CPLEX branch-and-cut frame- 
work to solve ILP ([I])-©. We first solve the LP relaxation of an initial ILP 
(with reduced number of constraints ©"(El))- In case of fractional solution, we 
solve a separation problem that yields a violated inequality cutting off this so- 
lution. In practice, often several of such cutting planes are generated. The new 
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constraints are added to the relaxed ILP problem which is solved again, and 
so on. If no more cutting planes can be found, the formulation comprising the 
new constraints is solved by CPLEX within a branch-and-bound algorithm. 

We now show how cutting planes for all types of constraints in model 
([7]) can be generated using Dijkstra's (or general shortest path) algorithm in 
a directed acyclic graph. For inequalities © this is identical to [TB]. For the 
remaining inequalities the principle is the same. 

Cutting planes for inequalities ^ can be generated efficiently by assigning 
value Xik obtained by the linear relaxation as weight to each alignment graph 
node. We can determine the most violated constraint by computing the maxi- 
mum weight decreasing path C in the alignment graph nodes V. If this weight 
is greater than one, the nodes on the decreasing path violate constraint ©. 
Since node weights are between zero and one, we can always lift a constraint 
of type ^ by adding x-variables that refer to a corresponding decreasing 
path of maximum cardinality. Hence, we search for the maximum weight path 
amongst all decreasing paths of length {ua x ub) — 1- We reformulate this 
problem to a minimum weight path problem by assigning weights (1 — Xik) to 
the nodes. Since all weights are positive, we can use Dijkstra's algorithm to 
solve this problem. If for the path C found in this way J^i kec > 1 holds, 
it generates a cutting plane. 

Cutting planes of type ^ and ^ are generated analogous to cutting 
planes of type (O. For each node i.k we generate up to two cutting planes, one 
for incoming edges and one for outgoing edges. In the first case, we assign for 
fixed i.k the weights yjuk to each node j.l e V~{i.k), and in the second case 
we assign yiuji to each node j.l G V+(i.fc). Then we compute the maximum 
weight decreasing path in V~{i.k) and V'^{i.k), respectively. If this path has 
weight greater than Xik, we identified a cutting plane. The reformulation of 
the maximum weight path problem to a minimum weight path problem that 
can be solved with Dijkstra's algorithm is analogous to cutting plane genera- 
tion for constraints ([6]). In worst case we identify 2{nA — ~ 1) violated 
inequalities ([3]) and (|4]) in each iteration. 

Similarly, we identify violated activation constraints ([S|). For each node 
i.k, we assign the weight yikji — Xji to each node j.l g V^{i.k) with objective 
function coefficient s{Aij, Bk.i) < 0. Note that, because we consider only 
edges with negative objective function score, in this case we do not know the 
cardinality of the path beforehand. If the weight of the lightest decreasing 
path 4-1 is smaller than Xik, we identified a violated inequality Since 
we compute the minimum weight path in a directed acyclic graph with edge 
weights less than or equal to zero, we can not apply Dijkstra's algorithm. 
Instead we traverse all nodes in topological order, which is provided by sorting 
according to the above defined order on the nodes. A constraint of type ^ only 
cuts off the current solution if its Xik value is greater than zero. In practice, a 
good strategy is to generate cutting planes (O only for nodes i.k with Xik = 1, 
because the alignment is mainly determined by edges {i.k, j.l) with positive 
score s{Ai,j,Bk^i) and there are comparatively very few edges with negative 
score in a solution. 
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5 Results and Discussion 

5.1 Data Sets and Setup of Computational Study 

We evaluate variable elimination and the branch-and-cut approach on the 
Skolnick data set, a benchmark for clustering proteins. It consists of 40 pro- 
tein chains of length between 97 and 255 residues that belong to five different 
SCOP (structural classification of proteins) families. This data set has been ex- 
tensively used for the evaluation of CMO algorithms, e.g., in pi[51 ll2[[^ . dali 
skips hetero atom records and residues with incomplete backbone. Therefore 
we edited 7 PDB (protein data bank) files of the Skolnick data set; we changed 
heteroatom records to atom records and excluded residues that were for other 
reasons ignored during dali computation. Furthermore, in order to be in line 
with DALI, we consider only the first decimal place of Cq atom coordinates. 
The remaining computations were carried out with double precision leading to 
slight differences between the overall score reported by dali and the recom- 
puted score. 

To cluster the proteins in the Skolnick benchmark, we compute for each 
pair of proteins an alignment and a corresponding similarity score, resulting 
in 780 alignments. We compute them using dali and analyze the results, dali 
evaluates similarities using an empirical z-score which is based on the align- 
ment's dali score. A z-score above 8 yields good structural superpositions [lOj . 
In the Skolnick data set there are 164 alignments with z-score greater than 8. 
They all correspond to pairs of proteins from the same family and are con- 
sidered "easy" instances in [2] . All alignments between pairs of proteins from 
different families have a z-score of less than 4 (or dali detects no similarity). 
This gap between alignments with high z-score and alignments with low z- 
score is promoting the well-known fact that the Skolnick data set is a rather 
easy benchmark [7]. We focus in the subsequent analysis only on alignments 
with z-score greater than 8 because the poor performance of variable elimina- 
tion for dissimilar proteins renders it currently infeasible to fit the problems 
into memory. 

Furthermore, we use alignments from Sisy [5J|3], a more challenging data 
set that has been designed with the objective to provide difficult structural 
alignment instances. These alignments are individually inspected by experts 
and essentially manually created; therefore we consider them gold-standard 
reference alignments. 

In order to compute optimal DALI alignments using our branch-and-cut ap- 
proach, we use CPLEX version 12.1 and a maximum running time of 30 hours. 
Instead of generating cutting planes for inequalities ^ , we use a polynomial 
number of constraints as initial ILP, 



k i-l 

1=1 3 = 1 



(8) 
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These constraints from [2 describe the set of increasing paths. We then 
add cutting planes and (O within a CPLEX cutting plane callback 

function. Alignments have been computed on cluster nodes each equipped with 
two quad core 2.26 GHz Intel Xcon processors and 24 GB of main memory 
running 64 bit Linux. 

5.2 Importance of Preprocessing 

Real-life problem instances for aligning complete inter-residue distance ma- 
trices are huge, consisting of uatib x-variables and ("2"*) ("2^) 2/-variables. We 
therefore have to rely on a very effective preprocessing. In this work, we present 
variable elimination as preprocessing in order to minimize the number of vari- 
ables that have to be considered explicitly in the dali ILP model. The effec- 
tiveness of this preprocessing depends on the similarity of the two proteins; if 
we apply preprocessing to two identical proteins, only the x-variables denot- 
ing identical residues and the y-variables of corresponding pairs of distances 
remain. As expected, the percentage of x- variables eliminated during prepro- 
cessing thus correlates with the dali z-score of the alignment (correlation 
coefficient 0.91, data not shown). The percentage of eliminated x-variables 
can be used successfully for correct classification of the Skolnick data set. This 
is an indication that highly similar proteins are easy to align optimally — an 
observation first made with exact CMO alignment algorithms. 

Variable elimination has to be effective, firstly in order to reduce the model 
such that it fits into memory and secondly such that we do not have to add too 
many cutting planes. This is oftentimes feasible for the class of alignments that 
leads to good structural superposition (z-score greater than 8). For pairs of 
proteins with more subtle structural similarities, for example if only subsets of 
both proteins are structurally similar, this preprocessing fails. For these cases, 
the current variable elimination is a good first step, but it is not sufficient to 
reduce the model such that it can be handled by CPLEX. 



5.3 Optimal Distance Matrix Alignments 

Our branch-and-cut approach solves 75 of the 164 Skolnick alignments (46%) 
to provable optimality. The running times for the solved instances vary between 
24s and 70296s and depend on the number of y-variables after variable elim- 
ination. The biggest solved instance has little less than 3 million y-variables. 
For 23 protein pairs (14%), CPLEX runs out of memory before the time limit 
is reached. Those instances have more than 15 million pairs of distances. 

For 32 of the solved Skolnick instances (43%), the heuristic solution pro- 
vided by DALI was proven to be optimal. In the remaining instances, the op- 
timal solution improved the heuristic solution slightly (less than 2% improve- 
ment in DALI score). With improvements in this order of magnitude, dali 
might not fail to produce the optimal alignment, but determine a different 
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alignment because it possibly uses different precision during computations. 
Furthermore, the compared Skolnick proteins are very similar in length and 
structure and the optimal alignment usually globally aligns them with a few 
small gaps. Therefore a good performance of the dali program is expected. 
We assume that in more difficult instances, in which the optimal alignment is 
not global, there is more room to improve on the heuristic dali solution. 

We use small Sisy alignments in order to evaluate our branch-and-cut ap- 
proach on such more difficult instances. Four instances are solved. We prove 
the optimality of the heuristic dali solution for two of them and for two im- 
prove with our exact solution the heuristic dali score slightly (by less than 



For global alignments of structurally very similar proteins, e.g., Skolnick 
instances, the shift from a simple scoring function like CMO, to a more sophis- 
ticated scoring of all distances using for example the dali scoring function, 
shows little effect. Nonetheless, our previous evaluation of the complete Sisy 
data set illustrates that for detection of more complex structural similarities, 
basic scores as CMO are not always sufficient to obtain gold-standard reference 
alignments, and the DALI scoring function performs on average better [23) . 



Fig. 3 Alignment of laawA (gray) and IgxiE (pinlc), an instance of tlie Sisy set. Opti- 
mal superposition according to the respective alignment. Residues colored in dark tone are 
aligned, residues colored in light tone are unaligned. Left: The Sisy reference alignment (29 
aligned residues, RMSD of 1.14). Middle; The heuristic DALI alignment correctly aligns all 
residues of the reference alignment, but extends the alignment length to 50 (RMSD of 2.55). 
Right: The optimal CMO alignment; it correctly aligns 96.55% of the aligned residues of 
the reference alignment. Alignment length is 56, RMSD 4.25. Additional gaps are inserted. 
Overaligning and insertion of additional gaps leads to a low RMSD value. 



5.4 Conclusion and Future Directions 

The branch-and-cut approach presented in this paper is the first exact algo- 
rithm that offers the feature of penalizing the alignment of non-compatible 
distances, e.g., a small and a large distance. Furthermore, it illustrates for 
the first time that it is feasible to compute provably optimal alignments of 
complete inter-residue distance matrices by means of preprocessing followed 
by a dedicated branch-and-cut approach that is implemented within a general 
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purpose ILP solver. Especially, we were able to compute dali alignments to 
optimality, demonstrating that this popular and widely used heuristic struc- 
ture comparison method generates optimal or close-to-optimal alignments, at 
least for relatively similar and relatively small problem instances. 

The techniques we apply are currently successful only in the cases of sim- 
ilar proteins. Based on our experience, we believe that it is not realistic to 
provide exact solutions for all instances. On the other hand, all our results 
show that it is not necessary to consider all pairs of inter-residue distances 
in order to obtain good alignments. Methods from molecular distance geome- 
try [24] , for example, can uniquely reconstruct a protein's 3D structure from a 
small subset of distances. Current alignments of sparse inter-residue distance 
matrices might thus perform promising because they capture to large extend 
the protein's structure. Nonetheless, we observe that it is essential to penalize 
non-compatible distances by assigning them a negative score. An illustration is 
given in Figure [3j the CMO scoring function greedily aligns as many residues 
as possible, which leads to an increase of alignment length at the expense of 
precise structural similarity of compact substructures as measured by RMSD. 

Because of these observations, we plan to move towards structural align- 
ment scoring schemes that reduce the size of the model by excluding pairs of 
distances. A first successful, yet still non-negative, scoring scheme was pre- 
sented in [53]. Secondly, A_PURVA, the implementation from [5], is able to 
provide optimal CMO alignments even for very large and dissimilar instances. 
Based on these observations we will come up with hypotheses which pairs 
of distances can be excluded without losing alignment precision. Computing 
alignments to optimality, we can then objectively test these hypotheses. 
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